LONDON BOROUGHS AND THEIR FACILITIES

Notebook Instructions

Although most of the plots are displayed in the notebook, the output of the interactive plots is not shown. For this reason, an html file of the notebook (the axes and legends are misplaced when submitting it to Github) and a folder called 'Interactive Plots' with the ordered images of the interactive plots are also provided in order to visualize these plots. To identify these plots in the notebook, an (*) will appear in the parts where an interactive plot should be displayed.

Moreover, after the scraping and cleaning process the data is stored in the files 'Table1.csv' and 'Table2.csv' for convenience. These files are also uploaded.

Introduction

"When a man is tired of London, he is tired of life; for there is in London all that life can afford" (Samuel Johnson,1777). London, one of the largest cities in the world, composed of 32 boroughs and the City of London, can offer a genuine quality of life to the people living in each of these boroughs. The urban planning of services or facilities accessible to citizens clearly improves their quality of life. This study shows that certain boroughs are under-resourced and it is important to promote the territorial rebalancing of these services or facilities. The usefulness of this work is to present the existing services and facilities each borough has, allowing investors and the public administration to increase the under-supplied services.

What will happen to London in the future, and does it have the potential for orderly and equitable development between the different boroughs?

This paper attempts to analyse, based on a series of socio-economic characteristics of each borough (population, area, population density, employment rate, average income per employee, average self-employed income, etc.), their level of facilities (schools, restaurants, shops, attractions etc.) development.

From here, by the comparison between boroughs with the same socio-economic characteristics and knowing the facilities that each one has, we analyse if it is possible any potential growth of the underprovided type of facility in certain boroughs. This study can be useful:

Research Questions

Data Acquisition

The data used in this analysis is taken from different sources and divided in two tables:

In this study, we have tried to get the most recent data but, unfortunately, this is not an easy task as the most updated data is not always available and also to obtain the desired variables many sources are used and might contain information from different years. However, from this point onwards, the data is assumed to be true.

Table I: Socio-Economic Characteristics of the Boroughs

As stated above, Table I comprises the socio-economic characteristics of the boroughs. These characteristics are obtained using the following methods:

  1. Excel: the variables are acquired from 'daytime-population-boroughs'(Sheet 2014) and 'london-borough-profiles'(Sheet Data) and can be found in the following links:

  1. Web Scraping: the variables are obtained from the following web pages:

Excel

From the excel 'daytime-population-boroughs' (Sheet: 2014), which contains information about the population of the boroughs during the day, only the characteristics of 'Workday population (employee)', 'Workday population (self-employed)', 'Day Overseas Staying visitors', 'Day Domestic Staying Visitors' and 'Day Trip Visitors' are selected for the study. Our main interest is focused in employees and tourists as they are the ones that could be expending, as the unemployed and the children might not have the capacity to spend.

Moreover, the 'london-borough-profiles' (Sheet: Data) provides numerous detail on the characteristics of the boroughs, however, only the variable 'Transport Accesibility Score' is added to Table I as the rest of the characteristics are obtained through web scraping.

After obtaining the desired variables from these two data bases, a data frame is created for each of them. These data frames are subsequently merged together in a new one called Excel.

Web Scraping

By applying the method of web scraping, the variables 'Average Employee Income', 'Average Self-Employee Income', 'Population', 'Area', 'Density', 'Percentage of population 75+', 'Percentage of pulation 16-74', 'Percentage of population 16-', 'Crime Rate (per 1000)', 'Employment Rate', 'Average Housing Price' and 'Percentage Annual Change Housing Price' are obtained. After acquiring each of them, they are stored in a data frame and merged together in a new one called Data_web_scraping.

The only thing to mention is that when obtaining the variables 'Percentage of population 75+', 'Percentage of pulation 16-74' and 'Percentage of population 16-, two web pages are used as the information for some of the boroughs is missing in the first one. Similarly, when acquiring the 'Crime Rate' variable, the entry for City of London is missing and it is scraped from a different source.

Table I after Data Acquisition

In order to obtain Table I, the data frames Excel and Data_web_scraping are merged.

Table II: Facilities of the Boroughs

In this case, Table II contains the number of different types of facilities for each of the boroughs. These features are acquired using the following methods:

Web Scraping

The variables 'Restaurants', 'Shops', 'Attractions', 'Clubs' and 'Leisure Venues' are scraped from the same webpage (stated above) and stored in a dataframe. While the characteristics 'Primary Scools', 'Secondary Schools', 'Mainstream Schools', 'Independent Schools', 'Nursery Schools', 'Infant Schools' and 'Academy Schools', are obtained from different sources (since the main one does not contain the information for some of the boroughs) and merged together in another data frame. Furthermore, for some cases the data is written in words instead of numbers, so an extra function is created in order to convert these words into their respective numbers. Finally, the two data frames are merged together in a new one called web_scraping2.

API

The Foursquare API is used to identify the venues located in each borough. This API returns a maximum number of 50 venues for a given latitude and longitude, but it provides more flexibility in terms of the number of calls per hour on a free account than other API's like Google's Places API. However, this limitation is a major inconvenience since all the boroughs approximately reach or surpass this value of 50. Furthermore, the area taken by this API is a circle, which is not the true shape for most boroughs. However, the number of venues outputted after applying the algorithm for a radius of 1000 m (smallest borough has approximately this radius) is taken as the true number of venues in our analysis, hoping that as we are interested in the comparison across boroughs a similar proportion of the facilities will hold.

Before being able to use the Foursquare API, the latitude and longitude are scraped from https://en.wikipedia.org/wiki/List_of_London_boroughs and the radius is computed. These variables are stored in a single data frame. After getting the number of pharmacies from the Foursquare API, a data frame is created containing the columns: 'Borough'(borough name), 'Name'(pharmacy name) and 'Adress'. A further variable 'Number of Stores' is created in order to sum up all the pharmacies later on. A similar process is done for the number of hotels, cinemas and theatres.

API Data Cleaning

After obtaining these data frames, a process of reshaping the data needs to be applied before being able to merge the data acquired from the API. For the data frames 'Pharmacy', 'Hotel', 'Cinema' and 'Theatre' the followig steps are taken:

  1. Join the 'Cinema' data frame and 'Theatre' together by row in a new data frame called 'Cinema_and_Theatre' (as there is some overlapping between them). Keep the data frames 'Pharmacy' and 'Hotel' unchanged in this step.

  2. Drop the duplicates keeping the first one.

  3. Sum the number of venues by borough using the Groupby function.

  4. Rename the column 'Number of Stores' to 'Number of Pharmacies', 'Number of Hotels' and 'Number of Cinemas/Theatres' for each data frame, respectively.

  5. Group all these data frames into a single one called API.

  6. Replace all the 'Na' values of this new data frame API by 0.

Table II after Data Acquisition

In order to obtain Table II, the data frames web_scraping2 and API are merged.

Data Cleaning

Although some data cleaning is done when acquiring the data (especially in the API section), we further continue to clean the tables in order to have a finalised version of the data. Note that when doing some of the plots further in the notebook, there might be some extra cleaning required to fulfil our goals, but the main part is done in this section. Additionally, the tables (Table I and Table II) are saved for convenience.

Table I: Socio-Economic Characteristics of the Boroughs

The following cleaning steps are taken for Table I:

  1. Convert the columns from 'str' to 'int' and 'float'.

  2. Add a new column with name 'Workday Population' by summing the columns 'Workday population (employee)' and 'Workday population (self-employed)'.

  3. Add a new column called 'Day Visitors' by summing the columns 'Day Overseas Staying visitors', 'Day Domestic Staying Visitors' and 'Day Trip Visitors'.

  4. Drop the columns 'Workday population (employee)', 'Workday population (self-employed)', 'Day Overseas Staying visitors', 'Day Domestic Staying Visitors', 'Day Trip Visitors' and 'Percentage Annual Change Housing Price' as they are stored in another variable or are irrelevant in our analysis.

  5. Drop possible duplicates by keeping the first one.

  6. Replace all the 'Na' values by 0 (since the analysis is done across the boroughs, if a value is missing, it is assumed to be 0 as deleting the row would make our analysis incomplete because there would be missing boroughs).

  7. Set the index name of the data frame to be 'Borough'.

Table II: Facilities of the Boroughs

The following cleaning steps are taken for Table II:

  1. Convert the columns from 'str' to 'int' (since they are just counts).

  2. Add a new column with name 'Schools' by summing the columns 'Primary Scools', 'Secondary Schools', 'Mainstream Schools', 'Independent Schools', 'Nursery Schools', 'Infant Schools' and 'Academy Schools'.

  3. Drop the columns 'Primary Scools', 'Secondary Schools', 'Mainstream Schools', 'Independent Schools', 'Nursery Schools', 'Infant Schools' and 'Academy Schools' as they are stored in another variable.

  4. Drop possible duplicates by keeping the first one.

  5. Replace all the 'Na' values by 0 (since the analysis is done across the boroughs, if a value is missing, it is assumed to be 0 as deleting the row would make our analysis incomplete because there would be missing boroughs).

  6. Set the index name of the data frame to be 'Borough'.

  7. Rename the columns 'Number of Pharmacies', 'Number of Hotels' and 'Number of Cinemas/Theatres' to 'Pharmacies', 'Hotels' and 'Cinemas/Theatres' to have a similar format across the columns.

Saved tables

This code cell can be used to run the saved tables of the data scraped.

Preliminary Analysis

Before being able to delve deeper into the analysis of grouping the boroughs by their similarities, their characteristics should be analysed to get a better understanding of the distribution of the variables, their correlations and to get a broad perspective of the differences between the boroughs and the number of facilities each of them has. The features of Table I and Table II are analysed in the following sections.

Table I: Socio-Economic Characteristics of the Boroughs

Table I has 14 variables, so in order to get an overall picture of the variables distribution, the data is standardized and displayed in an interactive boxplot below (*) . As it is interactive, it enables you to zoom in to have a better look at a variable's boxplot or select multiple boxplots of the desired variables by clicking on the legend.

From the plot, it can be seen that the variables 'Average Public Transport Accesibility Score', 'Average Self-Employee Income', 'Percentage of Population 75+' and 'Crime Rate (per 1000)' seem to have a higher variation than the rest. While the features of 'Average Employee Income', 'Population', 'Employment Rate' and 'Percentage of Population 75+' have the lowest variations.

Furthermore, when analysing the symmetry of the boxplots, the 'Average Employee Income', 'Percentage of Population 75+' and the 'Employment rate' seem to be symmetric. However, the rest are either right skewed (the median is closer to the first quartile) or left skewed (the median is closer to the third quartile). The right skewness (left skewness) implies that most of the observations achieve the lower (higher) values. Variables like 'Average Self-Employee Income' and 'Average Housing Price' are right skewed, which is reasonable as the housing prices and self-employee income can vary significantly depending on the borough. On the other hand, variables like the 'Average Public Transport Accesibility Score' and 'Day Visitors' are left skewed, which means that there are few boroughs with low values of these variables. Hence, the services can be improved in these boroughs. Finally, it seems like across the boxplots there is always an outlier, this could be the City of London as it differs from the rest of the boroughs.

After having an overview of the different variables, we analyze this variables further by taking a look at them in groups. The 'Population' and 'Percentage of Population' are analysed first, then the 'Crime Rate', followed by the 'Average Employee and Self-Employee Income', 'Employment Rate' and 'Average Housing Price'. Finally, the 'Workday Population', 'Day Visitors', 'Density of the Population' and 'Average Accesibility Score' are inspected.

Population and Percentage of Population for different age groups

It is surprising to see from the first plot below how the City of London, although it is one that has more visitors and workday population during the day, has the lowest population by a very significant difference. On the contrary, Barnet and Croydon, which are on the outskirts, have the highest population. This fact can be explained to the high cost of living in the center of London.

Moreover, from the second plot below it can be observed that the largest percentage of the population is between the ages of 16 and 74, as it agglutinates most of the ages. However, eventhough the percentage for the population above 75 years of age is only 6.4%, this could increase significantly during the following years due to the increase in life expectancy and decrease in birth rates.

Crime Rate (per 1000)

In order to compare the 'Crime Rate (per 1000 residents)', a pie plot (*) is displayed to see which are the safest and most dangerous boroughs. As this plot is interactive, the 'Crime Rate (per 1000)' and the borough name are also shown. Furthermore, by clicking on the legend some of the boroughs can be taken off the plot. This is useful if we want to make smaller subgroup comparisons between the boroughs.

The boroughs with the highest crime rate are City of London and Westsminster with a value of 612 and 281.7, respectively. But what it is more important to highlight, is their big difference with respect to the next borough with the highest crime rate, which is Kensington and Chelsea, with a rate of 155.7. The safest boroughs are Richmond Upon Thames and Harrow. However, all the borrows except City of London have a low crime rate.

Average employee and self-employee income, Employment Rate and Average Housing Price

The next thing to explore is the relation between the 'Average Employee Income' and the 'Average Self-Employee Income' with respect to the 'Employment Rate' and the 'Average Housing Price'. The expected result would be that as the 'Average Housing Price' increases, the 'Average Employee and Self-Employee Income' will increase. However, this is not the case as the correlation for these variables is nearly zero, shown in the plot below.

The correlation coefficients of the 'Employment Rate' and the 'Average Employee and Self-Employee Income' are low, but higher than the other correlations. This was unexpected, nevertheless, it can be due to the outlier acting as a point of high leverage in this case.

Moreover, it is of our interest to inspect further the contrast between the 'Average Employee Income' and the 'Average Self-Employee Income'. To that end, a kernel density estimate plot is presented below, in order to not only see the difference but also their respective distributions. From this plot, it can be observed that the 'Average Employee Income' on average tends to be higher than the' Average Self-Employee Income'. Nevertheless, the highest values are obtained by the 'Average Self-Employee Income'.

Workday Population, Day Visitors, Density of the Population and Average Public Transport Accesibility Score

Finally to end analysing the features of Table I, let us consider the variables of 'Workday Population', 'Day Visitors', 'Density'(density of the population) and 'Average Public Transport Accesibility Score'. The first two variables are summed to create a new one called 'Day Population' and plotted in a map of the London boroughs. Similarly, a map is also displayed for the variable 'Density'. The first map shows high concentrations of people in the surrounding boroughs of the City of London, not including this one. However, in the second map the majority of population during the day is in the City of London and also in Westminster.

It would also be of our interest to see if there is any correlation between the 'Average Public Transport Accesibility Score' and 'Day Visitors' or 'Density' variables. It can be seen from the scatterplots below that there is a moderately high positive correlation between these variables. Although from the plots it seems that the correlation with the 'Density' should be higher than that with the 'Day Population', when computing the correlation, this is not the case. Having a closer look it might be the outlier the one decreasing this correlation coefficient value.

Table II: Facilities of the Boroughs

Table II comprises the facilities of the boroughs and it is not used during the clustering procedure in the Data Analysis Section below. Instead it is employed afterwards, once the boroughs are subgrouped in order to see possible potentialities of less developed boroughs which share similar characteristics to others with more facilities. As a preliminary analysis, we want to analyze the type and number of facilities in each borough and compare these with the rest of the boroughs.

First, a barplot is drawn below in order to examine the number of venues across the boroughs. The 'Restaurants' and the 'Shops' are the venues with the highest count for most of the boroughs. The ones that have the largest number of facilities are Barking and Dagenham and Barnet. City of London, eventhough it is the richest, does not have the largest number of venues due to its small area. Waltham Forest and Wandsworth are the most disadvantageous in terms of number of facilities.

Further, an interactive pie chart (*) of the facilities is plotted for each borough, to be able to determine the proportion that each venue represents in the corresponding borough. Running the first block of code after the barplot will display the pie charts of all the boroughs, but to avoid repetition, only the output of the next block of code will be shown, displaying the pie charts for the first two. They are interactive plots, hence, some of the facilities can be removed from the plot by mouse clicking the corresponding venue in the legend. These pie charts give us a good summary of the proportion of the type of facilities found in each borough and which venues are more representative in each one. Furthermore, it allow us to compare the venues of different boroughs.

Data Analysis: Clustering

Once the exploratory analysis of the data has been done, we will group the boroughs that share similar socio-economic characteristics (Table1). To be able to fulfil our goals, we use several clustering techniques. There are four main categories of clustering methods: partition-based, density-based, distribution-based and hierarchical. For this study, four different clustering algorithms are considered, one of each category to make our analysis more robust. In this way, the algorithms of k-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Gaussian mixture model (GMM) and Agglomerative Hierarchical clustering, are selected. The techniques are then evaluated and the best one is chosen to conclude the final grouping of the boroughs.

As seen above, Table1 has 14 variables and only 33 observations, this can result in having some redundacy between the information of the variables. Let us have a first inspection by looking at the correlation matrix plotted below. The matrix shows pair of data variables with high correlations. It can be inferred from this plot that some variables should be omitted before applying the algorithms.

Principal Component Analysis (PCA)

Although the omission of redundant variables can be done by variable selection, we choose to employ PCA. It is an unsupervised dimensionality reduction technique which projects each observation onto the first few principal components, trying to preserve as much variation of the data as possible. In order to select the number of principal components to be used, first we standardize the data to make it scale-free. Then we apply PCA to the whole data and determine the number of components that explain most of the variation of the data. It can be observed from the interactive plot below (*) that the first seven principal components explain approximately 97% of the total variation, which is more than adequate for our analysis.

Clustering Algorithms

The application of PCA before applying the clustering algorithms it is useful to avoid redundancies between the variables and to avoid possible correlations. However, there is some information lost in the process which can lead to worse results than by keeping all the variables when applying the algorithms. Due to this, we will apply each of the algorithms keeping the whole data but also after applying PCA with 7 components, to analyze if we can get similar results after the dimensionality reduction process.

Tuning the parameters

k-means, DBSCAN, GMM and Agglomerative Hierarchical Clustering are unsupervised algorithms, hence the true labels of the observations are unknown and we cannot tune the parameters by training the algorithm. For this reason, the parameters values are left to the subjective choice of the user.

The Elbow Method has been used to choose the number of clusters for the k-means algorithm, which plots the number of clusters against the sum of squared distances from the point being clustered to its cluster center. In this case the chosen tuning parameter is the place of the 'elbow' in the plot. To choose the parameter (number of clusters) for the GMM method, we will use the BIC-Score method, which plots the BIC-Score against the number of clusters. It aims to minimize this score and is a good method as it avoids overfitting by including a penalizing term.

For DBSCAN the k-distance graph is used to determine the parameter ϵ, which determines the maximum radius of each cluster. In this case, the parameter is chosen to be the 'elbow' part of the graph when plotting the index of the point against the distance of the point to its k-th nearest neighbour. In our case k is chosen to be 2, as we want at least 2 points in each cluster. Finally, for hierarchical clustering, a dendogram is displayed using the Ward’s Minimum Variance linkage and the Euclidean distance to decide the cutting point which determines the number of clusters.

The implementation of the tuning methods is displayed in the plots above. The number of clusters k ranges from 2 to 9 for the k-means and GMM algorithms. When selecting the parameters of the algorithms, we need to take into account that although having more specific subgroups is positive, too many will make single-observation clusters and this is undesirable in our study. In this way, for k-means, k-means after PCA, GMM, GMM after PCA, DBSCAN, DBSCAN after PCA, Hierarchical Clustering and Hierarchical Clustering after PCA the parameters are chosen to be 6, 6, 5, 6, 3.5, 2, 4, and 5, respectively.

Implementation of the Algorithms

In order to have a visualization of the performance of the different algorithms, PCA with 2 components is applied to the entire data and the labels obtained for each of the algorithms (k-means, k-means after PCA, GMM, GMM after PCA, DBSCAN, DBSCAN after PCA, Hierarchical Clustering and Hierarchical Clustering after PCA) are applied to the points of these plots. Before implementing the algorithms and PCA the data is standardized.

The plots of the algorithms after applying PCA with 7 components do not show a nice cluster separation, with many points overlapping, being the Hierarchical algorithm the best in this case. When using all the variables to predict the labels, GMM stands out from the rest as it is able to display clearly separable clusters with no overlapping. This is expected as when the number of clusters k=5, the BIC-Score is very low, clearly indicating that it is the optimal value for the tuning parameter. DBSCAN was also able to achieve well separated clusters, however it was an easier task because it only found two clusters (the points labelled with as -1 are outliers).

Metrics and Evaluation of the Algorithms

In order to evaluate the performance of the algorithms for this unsupervised case, three metrics are considered:

By inspecting the interactive table above (*), it can be inferred that the best score with respect to the metrics Davies Bouldin Score and Silhouette Score is achieved by DBSCAN. While k-means obtained the best Calinski Harabasz Score. The good performance by the DBSCAN algorithm can be explained by the fact that it is only able to find 2 clusters, hence the clusters were well separated. However, we would like to have different subgroups of boroughs and not just a big one containing nearly all of them.

Although the performance after applying PCA with 7 components drops with respect to applying the algorithms on the entire data set, this drop is not significant for most of the algorithms considering that half of the variables are not included in the model.

A decision must be made with respect to which algorithm to select when grouping the boroughs. Although DBSCAN seems to have the best performance as stated above, to increase the number of subgroups, k-means is chosen as it is the next best performing algorithm and it is able to divide the data into 5 subgroups.

The 'Best' Algorithm (k-means)

The labels obtained after applying the k-means to the whole data are displayed in the geographical map below in order to more clearly see which boroughs are grouped together. From the map, we can see how City of London and Westminster are not grouped with any other borough, they are the richest ones and do not even share similar characteristics between them. The boroughs surrounding the City of London are clustered together. Similarly, the outskirt boroughs are grouped together. It is curious to see how boroughs that are next to each other share similar characteristics and are clustered together.

Our main interest for this study was to see possible potentialities of less developed boroughs when comparing them with other boroughs belonging to the same group. Hence, the facilities of the clustered together boroughs are displayed in the data frames below. It can already be seen from the data frames that there are bussiness and growth opportunities for boroughs with greater capacity of development by considering their similarity to boroughs with the same label. But let us take a closer look in the next plot.

In the plots below the number and type of facilities is represented for grouped together boroughs. This plots give us a clearer picture of the potentialities of the different boroughs. In the first plot, Brent and Wandsworth are the boroughs with the greatest number of facilities and the rest of the boroughs belonging to this cluster can have large improvements, especially Sutton and Barking and Dagenham. The plots for Cluster 1 and Cluster 2 do not provide us much information other than that City of London and Westminster seem to be very different between them and with respect to the rest of the boroughs. Cluster 3 contains the boroughs which are in the center surrounding City of London and it can be seen that further improvement can be made. The last two plots are the ones for the boundary boroughs, these are less developed than other more central boroughs and hence, there are unlimited opportunities of development.

Conclusion

In this study, through data collection, data cleaning and the application of clustering algorithms, it has been possible to compare London boroughs and their facilities. Considering that the k-means algorithm is the most appropriate, we managed to subgroup the boroughs according to their socio-economic characteristics. In this way, potentialities of the less developed boroughs of each cluster can be reflected. From this, it can be concluded:

Although the study is focused on specific variables and data that can help people or entities make decisions depending on the purposes they seek, I would have liked to analyze variables and/or data that could also yield an "emotional" or "vital" assessment of the neighborhoods that could provide a less utilitarian vision and more associated with the maxim that Montaigne said, "It is fruition, and not possession, that renders us happy".